The evolution of South Africa’s strategic policy directions was analysed through the application of natural language processing (NLP) techniques to the texts of the Reconstruction and Development Plan (RDP), Growth, Employment And Redistribution (GEAR), the Accelerated and Shared Growth Initiative South Africa (AsgiSA), the New Growth Path (NGP), and the National Development Plan (NDP).
All documents emphasise the role of governance and public sector for economic development. RDP is emphasising the terms related to democratisation and reconstruction, with less importance shown to issues related to economic growth. This changes in GEAR where issues of fiscal policy and stability, together with employment related issues, come to the fore. AsgiSA brings up terms related to projects and institutions with additional emphasis on agriculture. NGP is more centrally concerned with economic growth, aspects of green economy, and employment issues, particularly among youths. The most recent NDP is interesting in that it has relatively fewer mentions of economic growth, employment, and real sector related terms. What sets NDP apart from previous development plans is its emphasis of health and low carbon economy, and corruption related issues.
In statistical, probabilistic topic modeling analysis, nine topics are identified across all development plans. In the decreasing order of proportion of development plans’ documents, the topics relate to: climate change and resources; green economy; corruption and security; health; skills and training; economic growth; fiscal policy and macroeconomy; reconstruction and democracy; and education. Supporting exploratory study results, probabilistic topic modeling analysis suggests that health, climate change and resources, and corruption and security are more prominent in NDP compared to other development plans. Skills and training is covered more in AsgiSA and marginally so in NGP, compared to other development plans. NGP also gives more prominence to green economy topic compared to other development plans. Fiscal policy and macroeconomy has higher coverage in GEAR. And, as expected, reconstruction and democracy are more covered in RDP. Analysis of the relationship between topics suggests that topics related to education and green economy are more likely to be covered in the same document. Similarly, economic growth and fiscal policy and macroeconomy topics appear in the same development plans.
One important issue is the diversity of content in development plans over time. We observe a dramatic increase in size of the plans over time. This is particularly true for NDP, which stands at 162,056 total words and 6,627 sentencces. The number of distinct words in NDP is 32,965. This is more than three times higher than the second largest number in NGP and almost twenty times higher than in AsgiSA. That may reflect more diverse issues that are being discussed in NDP. Topic modeling results highlight the same issue. From nine topics identified NDP has a statistically significant, positive effect on four topics, while AsgiSA, GEAR, RDP have one statistically significant effect each and two for NGP.
Employment and jobs are prominent in the development plans. From probabilistic topic modeling we identified one topic (out of nine) focusing on issues of jobs and employment. More generally, employment related term is the 8th most frequent word in the whole corpus. Top 50 words also contain references to work, jobs, and labour.
In the analysis of National Budget Reviews we focused on two concepts (output stabilisation and credibility), defined through a set of keywords provided by the World Bank. Total mentions of both concepts in NBR corpus was traced since 1998 until 2017. “Output stabilisation” peaked in the 2009 NBR, while “credibility” reached its peak in 2013. Although “credibility” was generally more often mentioned in NBR over time, there has been an upswing in mentions of “output stabilisation” since 2016 and overtaking “credibility” in the 2017 NBR.
The World Bank Group (WBG) twin goals of ending extreme poverty and promoting shared prosperity reflect a new global landscape: one in which developing countries have an unprecedented opportunity to end extreme poverty within a generation. The WBG will face traditional and new challenges as it works with partners to reach those who live in extreme and moderate poverty. Indeed, many of those who emerged from poverty in recent years remain vulnerable to shocks and slowdowns in growth. Concerted efforts to equalize opportunities are necessary for substantial improvements in shared prosperity.
Reaching the ambitious WBG twin goals will require high and sustained economic growth across the developing world that also translates more effectively into poverty reduction in each country. This kind of robust, sustainable, inclusive growth—that achieves the maximum possible increase in living standards of the less well-off—is not business as usual, and has important implications for the WBG. In particular, the quest for economic growth, poverty reduction and shared prosperity can no longer be seen as separate, nor can policy options be viewed as a trade-off between economic growth and poverty reduction. At the same time, these priorities must be consistent with each country’s economic, social and institutional context and challenges—there is no one-size-fits-all solution. Ultimately, the twin goals demand a sharper, country-specific understanding of the constraints to growth and the trade-offs that available macro and sectoral policy choices entail, to promote substantial improvements in the welfare of the less well-off.
The WBG’s first joint strategy seeks to position the institution to deliver better for its clients by: (1) maximizing development impact by identifying and tackling the most difficult development challenges; (2) promoting scaled-up partnerships strategically aligned with the goals; and (3) convening public and private resources, expertise and ideas.
To identify the most important areas for interventions to achieve the WBG’s twin goals, the WBG conducts a Systematic Country Diagnostic (SCD) preceding the preparation of its Country Partnership Frameworks (laying out the intervention areas for WBG programs). The SCD is an analytical product and in the case of South Africa, it will be prepared in collaboration with the National Planning Commission/Department of Monitoring and Evaluation in the Presidency. The relationship is governed by a Memorandum of Understanding.
As part of the relationship of the National Planning Commission, the World Bank team has committed to examining progress on the National Development Plan. To this end, it is important to understand the evolution of South Africa’s strategic policy direction since democracy in 1994, from the Reconstruction and Development Plan (RDP), Growth, Employment And Redistribution (GEAR), the Accelerated and Shared Growth Initiative South Africa (AsGISA), the New Growth Path (NGP) and the National Development Plan. One of the arguments for South African policy to be less effective than desired is that policy is articulated in a blurry way and is becoming increasingly fragmented—with instances of even competing objectives. This in itself can hamper progress on the twin goals given less effective policy. A consultant is to be hired to help with a quantitative analysis of the mentioned major South African policy documents.
Scope of the work covers application of natural language processing (NLP) to analyze RDP, GEAR, AsGISA, NGP, and National Development Plan. In addition, similar analysis is undertaken for National Budget Reviews from 1998 to 2017 to explore the evolution of emphasis on fiscal policy.
In relation to national development plans the aim is as follows: in the corpus of documents implement probabilistic topic modeling (Latent Dirichlet Allocation models) to identify core themes; estimate a range of models to identify optimal topic structure; visualize the results of topic modeling for interpretation using static graphics; identify semantic relationships between themes in the documents and map these topic correlations; identify key phrases across the documents in the corpus; and identify the relationship between structural factors and themes of the documents.
For the National Budget Reviews analysis: measure commitment to fiscal credibility and countercyclical fiscal policy by looking at frequency of the keywords and how it evolves over time. The dictionary of keywords identified as follows:
output stabilisation: automatic stabilisers, tax buoyancy, countercyclical fiscal policy, cyclically adjusted budget balance, structural budget balance, exchange rate absorption, stimulus package, fiscal stimulus, bracket creep adjustment, rebates increase, expansionary fiscal policy, accommodating fiscal policy
credibility: Sustainability, low risk, meeting targets, low volatility, sustainable fiscal path, low inflation, maximising growth, strong multipliers
Using the dictionary of keywords the aim is to identify the context where these keywords appear in budgets and summarize the context as topics and trace evolution of topics over time.
The texts of national development plans were downloaded from the following links:
New Growth Path collection consists of the following separate booklets that were used as separate documents:
The National Development Plan consists of fifteen chapters. As a single document it is 489 pages long, which is significantly more than any of the previous plans. Hence, for computational reasons, it was included in the analysis as separate chapters rather than one document. Overall, 24 documents were used in the analysis.
All the documents were converted into plain text files. Conversion from PDF to plain text led to multiple errors appearing due to some historical and no longer supported fonts compromising the conversion. Hence all the documents were spell-checked to capture most obvious typos.
Using R statistical software, plain text versions of national development plans were ingested and a “corpus” object for analysis was created (see https://en.wikipedia.org/wiki/Text_corpus for general introduction to the concept).
The table below provides summary information for our corpus.
We observe a dramatic increase in size of the plans over time. This is particularly true for NDP, which stands at 162,056 total words and 6,627 sentencces. Another way to look at the diversity of content in plans is focusing on types – distinct words. NDP has 32,965 types which is more than three times higher than the second largest number in NGP and almost twenty times higher than in AsgiSA. That may reflect more diverse issues that are being discussed in NDP.
The corpus was then transformed into separate words (tokens), accompanied by basic pre-processing:
Additionally, any digits and punctuation that may be part of tokens through mistakes in text conversion and input were also removed. Any tokens containing less than three characters long were removed as well. This picks up some additional mistakes and typos. Next all tokens were converted into lower case.
A document feature matrix (aka document term matrix) or DFM is a fundamental input into natural language processing (see https://en.wikipedia.org/wiki/Document-term_matrix). We construct a DFM from tokens after stemming and removing “stop words” (not carrying functional meaning) using the SMART list.
The DFM is trimmed by dropping tokens appearing less than three times, mainly to catch typos and text conversion mistakes. The logic is that if a token is used only once in all documents, that could be a feature that does not distinguish well between documents. Alternatively that can be a spelling mistake or typo. Total number of tokens (3135) shows the size of the trimmed DFM that we use in the analysis.
In our corpus as a whole, we can assess the most frequently occurring terms in our corpus, with the visualisation below focusing on the 20 most frequent words.
For convenience, the same information is presented as a traditional word cloud (with 100 most frequent terms).
We can also assess differences in frequency of word usage by development plans. This highlights the evolution of most frequently occurring terms (and thus saliency of the terms) over time.
Wordcloud plot of 100 most frequent terms in RDP:
We can explore which key terms appear in RDP more frequently than by chance using the concept of [keyness](https://en.wikipedia.org/wiki/Keyword_(linguistics). We calculate keyness for RDP compared to all other documents in our corpus (remaining development plans). The outputs are sorted in descending order by the association measure (chi2 here). Figure below visualises keyness between RDP and other development plans:
Terms on the right (in red) are words that appear significantly more frequently in RDP than would be expected by chance compared to all other national development plans. For example, “democrat”, “reconstruct”, “programm”, and “apartheid”. At the same time, the terms on the left (in blue) appear less frequently than in other documents: e.g., “growth”, “target”, “spatial”.
In GEAR more prominent terms are around wages and employment, deficit, expenditure, and fiscal issues.
AsgiSA most prominent terms are project acronyms pointing to more institution rather than policy focus.
NGP is introducing a set of commitments and references to green growth, and jobs.
NDP is interesting in that it has relatively fewer mentions of economic growth, employment, and real sector related terms. What sets NDP apart from previous development plans is its emphasis of health and low carbon economy, and corruption related issues.
In the topic model analysis we consider the thematic structure of development plans and effect of structural variables. Given limited number of documents in this part of the analysis, we are only looking at thematic differences across documents as structural effects. That is whether themes change across development plans. In order to achieve that we implement a structural topic model (Roberts et al., 2016). We model topic prevalence in the context of the development plan covariate (a factor variable with a level for each individual plan). The aim is to statistically test whether the metadata affects the frequency with which a topic is discussed in development plans – the average proportion of a document discussing a topic.
Structural topic model or STM (Roberts et al., 2016) is a type of probabilistic topic models (Blei et al. 2003) that allows to assess the effect of covariates (see http://www.structuraltopicmodel.com; for an introduction and a nice overview of topic modeling see http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf).
One key input into the topic modeling algorithm is specifying the number of topics the algorithm needs to uncover in the corpus. This can be done with a manual input, using human expert judgement to determine the number of topics. Alternatively, this can be done by focusing on semantic coherence (see Mimno et al., 2011) and exclusivity (see Bischof and Airoldi, 2012) measures. Highly frequent words in a given topic that don’t appear too often in other topics are said to make that topic exclusive. Cohesive and exclusive topics are more semantically useful.
We first generate a set of candidate models, here ranging between 3 and 30 topics. Then we plot exclusivity and semantic coherence estimates for each candidate model and choose the optimal number of topics as a balance between these two measures (see Roberts et al., 2016).
Plot below maps exclusivity and semantic coherence (numbers closer to zero indicate higher coherence), and select a model on the semantic coherence-exclusivity “frontier” (where no model strictly dominates another in terms of semantic coherence and exclusivity).
The model with nine topics is selected for our analysis (highlighted with vertical line). There’s a sharp drop in semantic coherence after \(k=9\).
One way to summarize topics is to combine term frequency and exclusivity to that topic into a univariate summary statistic. In STM package in R this is implemented as FREX (see Bischof and Airoldi, 2012 and Airoldi and Bischof, 2016). The logic behind this measure is that both frequency and exclusivity are important factors in determining semantic content of a word and form a two dimensional summary of topical content. FREX is the geometric average of frequency and exclusivity and can be viewed as a univariate measure of topical importance.
STM authors suggest that nonexclusive words are less likely to carry topic-specific content, while infrequent words occur too rarely to form the semantic core of a topic. FREX is therefore combining information from the most frequent words in the corpus that are also likely to have been generated from the topic of interest to summarize its content.
The table below presents four types of word weightings using alternative measures. Highest probability words list the words within each topic with the highest probability. FREX are the words ranked by their frex measure discussed above. Lift is calculated by dividing the topic-word distribution by the empirical word count probability distribution. Sievert and Shirley (2014) point that the lift measure (Taddy 2011) aims to de-rank high-frequency terms, but in practice it often gives high ranking to very rare terms occurring in only a single topic. Score is a metric used in the lda R package by Jonathan Chang.
In practice, manual topic labeling is usually evaluated through a combination of the metrics below.
Topic 1 Top Words:
Highest Prob: growth, employ, develop, south, path, job, economi, servic, econom, sector
FREX: driver, mine, path, growth, employ, framework, job, firm, creation, export
Lift: cwp, merger, spike, arbitr, buyer, curs, cushion, dfis, dst, edd
Score: speech, mine, export, dfis, edd, region, dismiss, nanc, diversifi, bee
Topic 2 Top Words:
Highest Prob: educ, school, develop, system, higher, nation, train, percent, qualiti, skill
FREX: school, teacher, scienc, teach, learner, educ, student, learn, knowledg, higher
Lift: postgradu, checklist, dropout, funza, interv, lushaka, phds, tongu, advisor, diploma
Score: teacher, mathemat, phd, learner, phds, certif, teach, percent, scienc, underperform
Topic 3 Top Words:
Highest Prob: health, social, south, system, care, servic, communiti, africa, percent, work
FREX: health, diseas, hiv, care, child, demograph, age, popul, insur, mortal
Lift: addict, conceptu, condom, gov, intersector, physician, pictur, rica, therapeut, antibiot
Score: mortal, matern, percent, age, health, nhi, death, hiv, hospit, insur
Topic 4 Top Words:
Highest Prob: train, skill, govern, develop, growth, busi, sector, commit, improv, programm
FREX: fet, artisan, traine, asgisa, skill, colleg, train, workplac, project, enrol
Lift: bpo, jipsa, recogn, umsobomvu, apprentic, asgisa, traine, fet, dead, eia
Score: fet, apprentic, asgisa, traine, jipsa, bpo, seta, dti, recogn, umsobomvu
Topic 5 Top Words:
Highest Prob: accord, commit, economi, local, green, youth, govern, busi, develop, procur
FREX: green, accord, procur, solar, constitu, localis, commit, youth, heat, instal
Lift: blsa, bought, cook, decemb, feder, hat, incandesc, jacket, mthalan, mxolisi
Score: accord, green, geyser, behalf, cop, heat, constitu, solar, localis, decemb
Topic 6 Top Words:
Highest Prob: develop, govern, programm, nation, rdp, communiti, south, polici, peopl, servic
FREX: rdp, democrat, reconstruct, hous, right, land, apartheid, legisl, rural, cultur
Lift: thoroughgo, abe, alli, amen, applianc, audio, captain, cbos, conglomer, councillor
Score: rdp, democrat, democratis, cent, reconstruct, media, right, parastat, peac, hostel
Topic 7 Top Words:
Highest Prob: south, africa, develop, econom, invest, polici, water, region, servic, transport
FREX: carbon, coal, mitig, spatial, ict, emiss, fuel, transport, gas, climat
Lift: angola, apport, augment, captiv, cleaner, combust, converg, crippl, des, desalin
Score: carbon, coal, region, refineri, reus, climat, emiss, mitig, corridor, bay
Topic 8 Top Words:
Highest Prob: public, servic, govern, municip, respons, depart, develop, polic, south, manag
FREX: corrupt, polic, recruit, municip, crime, safeti, justic, crimin, servant, soe
Lift: aptitud, blow, counterproduct, disagr, downgrad, freeli, lang, meritocrat, politician, prosecutor
Score: recruit, soe, whistl, polic, corrupt, servant, blower, deleg, junior, judici
Topic 9 Top Words:
Highest Prob: percent, growth, employ, increas, sector, rate, labour, market, year, polici
FREX: real, wage, exchang, deficit, fiscal, inflat, farmer, gdp, foreign, depreci
Lift: outward, spot, tenth, elast, semi, aggreg, agribusi, appendic, apr, aug
Score: page, percent, depreci, elast, exchang, dissav, macroeconom, assa, appendix, expenditur
Manually assessing the word weightings across the four metrics above we can introduce the following labels:
This labeling is an outsider interpretation of the word weightings and will necessarily change with more domain expertise brought in to label the topics.
Figure below displays the topics ordered by their expected frequency across the corpus, with illustrative top FREX words.
In the STM framework we can estimate the effect of external covariates. As mentioned above, here external covariates are limited to differences across development programmes. This is due to the fact that it’s difficult to unambiguously attribute economic indicators like inflation or unemployment rates to documents that span several years in preparation and implementation.
Estimation is done with a linear regression where documents are the units, the outcome is the proportion of each document about a topic in an STM model and the covariate is the factor variable for national development programmes. Estimation incorporates measurement uncertainty from the STM model using the method of composition.
Plots below display the effect of our covariate on each estimated topic. The covariate is a nominal five-level factor variable for each development plan. We estimate mean topic proportions for each value of the covariate, with corresponding 95% confidence intervals of the effect.
Topics 1 (Economic growth) and 2 (Education) appear in all development programmes in similar proportions highlighting their stable importance over time.
Topic 3 (Health) is given higher of the NDP compared to other development programmes.
Topic 4 (Skills and training) is given higher attention in AsgiSA compared to other programmes, with the exception of NGP that is also, albeit statistically marginally, has larger coverage of the topic.
Topic 5 (Green economy) is given higher attention in NGP compared to other development programmes.
Topic 6 (Reconstruction and democracy) has a high coverage in RDP, as would be expected from the early national development plan.
Topics 7 (Climate change and resources) and 8 (Corruption and security) have higher coverage in the most recent national development programme (NDP).
Topic 9 (Fiscal policy and macroeconomy) has higher emphasis in the 1996 GEAR programme and, statistically marginally, in NDP compared to other development plans.
Topic modeling results highlight the issue of issue diversity mentioned ealrier. From nine topics identified NDP has a statistically significant, positive effect on four topics, while AsgiSA, GEAR, RDP have one statistically significant effect each and two for NGP.
We can assess the relationship between topics in the STM framework that allows correlations between topics. We calculate the correlation between estimates of the topic proportions and drop edges below the correlation threshold 0.01. Positive correlations between topics suggest that both topics are likely to be covered within a development programme.
Two sets of topics are connected with each other: Topics 1 and 9, and Topics 2 and 5. We can contrast the words across two connected topics by calculating the difference in probability of a word for the two topics, and normalizing the maximum difference in probability of any word between the two topics. These are often called perspective plots, where words are sized proportional to their use within the plotted topic combinations and oriented along the X-axis based on how much they favour each of the topics. The vertical configuration of the words is random.
Intuitively, topics related to economic growth (Topic 1) and fiscal policy and macroeconomy (Topic 9) are related. However, the perspective plot below relative emphases on different aspects across the topics. The words that straddle the probabilistic boundary between two topics relate to services and labour market. The words more central to individual topics highlight aspects of economic development policies.
Topics related to education (Topic 2) and green economy (Topic 5) are also more likely to be covered in the same development programme.
The data for this analysis comes from the National Budget Reviews. We downloaded all chapters (but not the appendices), and converted into plain text files with UTF8 encoding.
The table below provides summary information for our corpus.
We also pre-processed the corpus following the same steps as above. In addition we also removed “cent” and “billion” from the corpus as, in our setting, these were high frequency non-function words.
Total number of tokens in NBR DFM is 4023. The most frequently occurring terms are shown below.
Visualised as a wordcloud:
We assess two concepts: output stabilisation and credibility. Each concept is described by a list of keywords listed below that were provided by the World Bank:
output stabilisation: automatic stabilizers, tax buoyancy, countercyclical fiscal policy, cyclically adjusted budget balance, structural budget balance, exchange rate absorption, stimulus package, fiscal stimulus, bracket creep adjustment, rebates increase, expansionary fiscal policy, accommodating fiscal policy
credibility: sustainability, low risk, meeting targets, low volatility, sustainable fiscal path, low inflation, maximizing growth, strong multipliers
With the necessary adjustment for multiple usage and spelling used in NBR the dictionary of keywords used was as follows (a “*" character indicates a wild-card, i.e. versions of word endings):
Dictionary object with 2 key entries.
- [output_stabilisation]:
- automatic stabiliser*, tax buoyancy, countercyclical fiscal polic*, cyclically adjusted budget balance*, structural budget balance*, exchange rate absorption, stimulus package*, fiscal stimul*, bracket creep adjustment*, rebate* increase*, expansionary fiscal polic*, accommodating fiscal polic*
- [credibility]:
- sustainability, low risk*, meeting target*, low volatility, sustainable fiscal path, low inflation, maximis* growth, strong multiplier*
Frequency of both concepts appearing in NBR is presented in table below.
The same information is visualised in the plot below:
We can assess the linkage between development plans and NBRs by calculating similarities between documents. The simplest similarity measure between two documents that normalises the length of the documents during comparison is cosine similarity. First, we normalised the Document Feature Matrix using the TF-IDF (term frequency inverse document frequency) weights. The weight increases proportionally to the number of times a term appears in a document and offset by the frequency of the word in the corpus. It’s a standard weighting system in Information Retrieval and aims to capture that some words appear more frequently. Second, we view documents as a set of vectors in a vector space. The cosine of the angle between two vectors is a measure of their similarity. This is a standard measure in Information Retrieval. In these settings, cosine similarity ranges between 0 and 1, where 0 means that documents are orthogonal and 1 means the documents are the same.
The tables and plots below provide cosine similarity measures for each development plan and full set of National Budget Reviews.
| RDP.1994 | |
|---|---|
| NBR.1998 | 0.1927 |
| NBR.2000 | 0.1291 |
| NBR.2005 | 0.1279 |
| NBR.2004 | 0.1264 |
| NBR.2002 | 0.1135 |
| NBR.2003 | 0.1057 |
| NBR.2001 | 0.0968 |
| NBR.2007 | 0.0964 |
| NBR.2010 | 0.0943 |
| NBR.1999 | 0.0933 |
| NBR.2009 | 0.0837 |
| NBR.2012 | 0.083 |
| NBR.2006 | 0.0769 |
| NBR.2014 | 0.0727 |
| NBR.2008 | 0.0721 |
| NBR.2017 | 0.0707 |
| NBR.2011 | 0.0695 |
| NBR.2013 | 0.0633 |
| NBR.2015 | 0.0598 |
| NBR.2016 | 0.058 |
| GEAR.1996 | |
|---|---|
| NBR.2006 | 0.1035 |
| NBR.2004 | 0.1012 |
| NBR.2010 | 0.1009 |
| NBR.1998 | 0.0844 |
| NBR.2000 | 0.0833 |
| NBR.2003 | 0.0801 |
| NBR.2002 | 0.0797 |
| NBR.2007 | 0.0734 |
| NBR.2005 | 0.0727 |
| NBR.2001 | 0.0682 |
| NBR.2008 | 0.0654 |
| NBR.2015 | 0.0639 |
| NBR.2016 | 0.0565 |
| NBR.1999 | 0.0504 |
| NBR.2009 | 0.0497 |
| NBR.2014 | 0.0488 |
| NBR.2013 | 0.0488 |
| NBR.2012 | 0.0451 |
| NBR.2011 | 0.0403 |
| NBR.2017 | 0.0385 |
| AsgiSA.2006 | |
|---|---|
| NBR.2001 | 0.0809 |
| NBR.2002 | 0.0602 |
| NBR.2008 | 0.0522 |
| NBR.2007 | 0.0504 |
| NBR.2016 | 0.0465 |
| NBR.2009 | 0.042 |
| NBR.2015 | 0.0417 |
| NBR.2006 | 0.0393 |
| NBR.1999 | 0.039 |
| NBR.2010 | 0.0323 |
| NBR.2005 | 0.0321 |
| NBR.2000 | 0.0316 |
| NBR.2003 | 0.0315 |
| NBR.1998 | 0.0286 |
| NBR.2017 | 0.0283 |
| NBR.2013 | 0.0281 |
| NBR.2014 | 0.0252 |
| NBR.2004 | 0.0237 |
| NBR.2011 | 0.0237 |
| NBR.2012 | 0.0201 |
| NGP.2010 | |
|---|---|
| NBR.2017 | 0.1437 |
| NBR.2003 | 0.1024 |
| NBR.1998 | 0.0961 |
| NBR.2001 | 0.0852 |
| NBR.2014 | 0.0812 |
| NBR.2000 | 0.0743 |
| NBR.2005 | 0.0723 |
| NBR.2013 | 0.0672 |
| NBR.2006 | 0.0655 |
| NBR.2015 | 0.0623 |
| NBR.2010 | 0.0593 |
| NBR.2004 | 0.0536 |
| NBR.2011 | 0.0501 |
| NBR.1999 | 0.0496 |
| NBR.2002 | 0.0463 |
| NBR.2009 | 0.0451 |
| NBR.2016 | 0.043 |
| NBR.2012 | 0.0427 |
| NBR.2007 | 0.0398 |
| NBR.2008 | 0.0382 |
| NDP.2012 | |
|---|---|
| NBR.2011 | 0.2032 |
| NBR.2002 | 0.2009 |
| NBR.2017 | 0.1925 |
| NBR.2013 | 0.1732 |
| NBR.2016 | 0.1677 |
| NBR.2006 | 0.1526 |
| NBR.2007 | 0.1512 |
| NBR.2004 | 0.1464 |
| NBR.2009 | 0.1459 |
| NBR.2010 | 0.1404 |
| NBR.2008 | 0.1387 |
| NBR.2001 | 0.134 |
| NBR.2014 | 0.1337 |
| NBR.1999 | 0.1325 |
| NBR.1998 | 0.1307 |
| NBR.2015 | 0.129 |
| NBR.2005 | 0.1258 |
| NBR.2000 | 0.1244 |
| NBR.2003 | 0.1127 |
| NBR.2012 | 0.1092 |
Sievert, C. and K. Shirley. “LDAvis: A method for visualizing and interpreting topics.” Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70, Baltimore, Maryland, USA, June 27, 2014.
Jonathan M. Bischof and Edoardo M. Airoldi. 2012. “Summarizing topical content with word frequency and exclusivity”. ICML.
David M. Blei, Thomas L. Griffiths, Michael I. Jordan, and Joshua B. Tenenbaum. 2003. “Hierarchical Topic Models and the Nested Chinese Restaurant Process.” NIPS.
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. “Optimizing Semantic Coherence in Topic Models.” EMNLP.
Matthew A. Taddy 2011. “On Estimation and Selection for Topic Models.” AISTATS.
Roberts, Margaret E, Brandon M Stewart and Edoardo M Airoldi. 2016. “A model of text for experimentation in the social sciences.” Journal of the American Statistical Association 111(515):988-1003.
Airoldi, EM and JM Bischof. “Improving and evaluating topic models and other models of text (with discussion).” Journal of the American Statistical Association 111 (516), 1381-1403.